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ABSTRACT 

A  new  mapping  algorithm  for  speech  recognition  relates  the  fea¬ 
tures  of  simultaneous  recordings  of  clean  and  noisy  speech.  The 
model  is  a  piecewise  nonlinear  transformation  applied  to  the  noisy 
speech  feature.  The  transformation  is  a  set  of  multidimensional 
linear  least-squares  filters  whose  outputs  are  combined  using  a 
conditional  Gaussian  model.  The  algorithm  was  tested  using  SRI’s 
DEQPHER™  speech  recognition  system  [1-5].  Experimental 
results  show  how  the  mapping  is  used  to  reduce  recognition  errors 
when  the  training  and  testing  acoustic  environments  do  not  match. 

1.  INTRODUCTION 

In  many  practical  situations  an  automatic  speech  recognizer  has  to 
operate  in  several  different  but  well-delved  acoustic  environ¬ 
ments.  For  example,  the  same  recognition  task  may  be  imple¬ 
mented  using  different  microphones  or  transmission  channels.  In 
this  situation  it  may  not  be  practical  to  recollect  a  speech  corpus  to 
train  the  acoustic  models  of  the  recognizer.  To  alleviate  this  prob¬ 
lem.  we  propose  an  algorithm  that  maps  speech  features  between 
two  acoustic  spaces.  The  models  of  the  mapping  algorithm  are 
trained  using  a  small  database  recorded  simultaneously  in  both 
environments. 

In  the  case  of  steady-state  additive  homogenous  noise,  we  can 
derive  a  MMSE  estimate  of  the  clean  speech  filterbank-log  energy 
features  using  a  model  for  how  the  features  change  in  the  presence 
of  this  noise  [6-7].  In  these  algorithms,  the  estimated  speech  spec¬ 
trum  is  a  function  of  the  global  spectral  signal-to-noise  ratio 
(SNR),  the  instantaneous  spectral  SNR,  and  the  overall  spectral 
shape  of  the  speech  signal.  However,  after  studying  simultaneous 
recordings  made  with  two  microphones,  we  believe  that  the  rela¬ 
tionship  between  the  two  simultaneous  features  is  nonlinear.  We 
therefore  propose  to  use  a  piecewise-nonlinear  model  to  relate  the 
two  feature  spaces. 

1.1.  Related  Work  on  Feature  Mapping 

Several  algorithms  in  the  literature  have  focused  on  experimen¬ 
tally  training  a  mapping  between  the  noisy  features  and  the  clean 
features  [8-13].  The  proposed  algorithm  differs  from  previous 
algorithms  in  several  ways: 

•  The  MMSE  estimate  of  the  clean  speech  features  in  noise  is 
trained  experimentally  rather  than  with  a  model  as  in  [6, 7]. 


•  Several  frames  are  joined  together  similar  to  [  1 3] . 

•  The  conditional  PDF  is  based  on  a  generic  noisy  feature  not 
necessarily  related  to  the  feature  that  we  are  trying  to  esti¬ 
mate.  For  example,  we  could  condition  the  estimate  of  the 
cepstral  energy  on  the  instantaneous  spectral  SNR  vector. 

•  Multidimensional  least-squares  filters  are  used  for  the  map¬ 
ping  transformation.  This  exploits  the  correlation  of  the  fea¬ 
tures  over  time  and  among  components  of  the  spectral 
.features  at  the  same  time. 

•  Linear  transformations  are  combined  together  without  hard 
decisions. 

•  All  delta  parameters  are  computed  after  mapping  the  cep- 
strum  and  cepstral  energy. 

•  The  mapping  parameters  are  trained  using  stereo  recordings 
with  two  different  microphones.  Once  trained,  the  mapping 
parameters  are  fixed. 

•  The  algorithm  can  either  map  noisy  speech  features  to  clean 
features  during  training,  or  clean  features  to  noisy  features 
during  recognition. 

1.2.  Related  Work  on  Adaptation 

The  algorithm  used  to  map  the  incoming  features  into  a  more 
robust  representation  has  some  similarities  to  work  on  model 
adaptation.  Some  of  the  high-level  differences  between  hidden 
Markov  model  (HMM)  adaptation  and  the  mapping  algorithms 
proposed  in  this  paper  are: 

•  The  mapping  algorithm  works  by  primarily  correcting  shifts 
in  the  mean  of  the  feature  set  that  are  correlated  with 
observable  information.  Adapting  HMM  model  parameters 
has  certain  degrees  of  freedom  that  the  mapping  algorithm 
does  not  have-  for  example  the  ability  to  change  state  vari¬ 
ances,  and  mixture  weights. 

•  Two  HMM  states  that  have  identical  probability  distribu¬ 
tions  and  are  not  tied  can  have  different  distributions  after 
adaptation.  These  distributions  cannot  be  differentiated  by 
mapping  features. 

•  The  mapping  algorithms  described  in  this  paper  are  able  to 
incorporate  many  pieces  of  information  that  have  been  tra- 
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ditionally  difficult  to  incoiporate  into  HMM  models  and  into 
adaptation  algorithms.  These  include  observations  that  span 
across  several  frames  and  the  correlation  of  the  state  fea¬ 
tures  with  global  characteristics  of  the  speech  waveform. 

These  two  techniques  are  not  mutually  exclusive  and  can  be  used 
together  to  achieve  robust  speech  recognition  performance.  The 
boundary  between  these  two  techniques  can  be  blurred  when  the 
mapping  algorithm  is  dependent  on  the  speech  recognizer’s 
hypothesis. 


2.  THE  POF  ALGORITHM 

The  mapping  algorithm  is  based  on  a  probabilistic  piecewise-non- 
linear  transformation  of  the  acoustic  space  that  we  call  Probabilis¬ 
tic  Optimum  Filtering  (POF).  Let  us  assume  that  the  recognizer  is 
trained  with  data  recorded  with  a  high-quality  close-talking  micro¬ 
phone  (clean  speech),  and  the  test  data  is  acquired  in  a  different 
acoustic  environment  (noisy  speech).  Our  goal  is  to  estimate  a 
clean  feature  vector  given  its  corresponding  noisy  feature 
y^  where  n  is  the  frame  index.  (A  list  of  S3rmbols  is  shown  in 
Table  1 .)  To  estimate  the  clean  vector  we  vector-quantize  the  clean 
feature  space  in  I  regions  using  the  generalized  Lloyd  algorithm 
[14].  Each  VQ  region  is  assigned  a  multidimensiond  transversal 
filter  (see  Figure  1).  The  error  between  the  clean  vector  and  the 
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Figure  1:  Multi-dimensional  transversal  filter  for  cluster  i. 


estimated  vectors  produced  by  the  i-th  filter  is  given  by 


%i  =  ^n-\i  = 


(1) 


where  e  .  is  the  error  associated  with  region  i,  W.  is  the  filter 
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coefficient  matrix,  and  is  the  tapped-delay  line  of  the  noisy 
vectors.  Expanding  these  matrices  we  get 
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The  conditional  error  in  each  region  is  defined  as 
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where  is  the  probability  that  the  clean  vector 

belongs  to  region  given  an  arbitrary  conditional  noisy  feature 
vector  .  Note  that  the  conditioning  noisy  feature  can  be  any 
acoustic  vector  generated  from  the  noisy  speech  frame.  For  exam¬ 
ple,  it  may  include  an  estimate  of  the  SNR,  energy,  cepstral  energy, 
cqjstrum,  and  so  forth. 

The  conditional  probability  density  function  is  modeled 

as  a  mixture  of  7  Gaussian  distributions.  Each  Gaussian  distribu¬ 
tion  models  a  VQ  region.  The  parameters  of  the  distributions 
(mean  vectors  and  covariance  matrices)  are  estimated  using  the 
corresponding  vectors  associated  with  that  region.  The  poste¬ 
rior  probabilities  p{g\z^  are  computed  using  Bayes’  theorem 
and  the  mixture  weights  P[^i]  sre  estimated  using  the  relative 
number  of  training  clean  vectors  that  are  assigned  to  a  given  VQ 
region. 


Symbol 

Dimension 

Description 

n 

1 

frame  index 

i 

1 

region  index 

L 

1 

feature  vector  size 

M 

1 

conditioning  feature  vector  size 

N 

1 

number  of  training  frames 

I 

1 

number  of  VQ  regions 

P 

1 

maximum  filter  delay 

^ni 

Lxl 

estimation  error  vector 

Lxl 

clean  feature  vector 

A. 

X„ 

Lxl 

estimate  of  clean  feature  vector 

^n 

Lxl 

noisy  feature  vector 

Mxl 

conditioning  noisy  feature  vector 

^i 

Mxl 

mean  vector  of  gaussian  i 

MxM 

covariance  matrix  of  gaussian  i 

^i 

(2p+l)L-t-lxL 

transversal  filter  coefficient  matrix 

(2p+l)L-t-l  X 1 

tap  input  vector 

^ik 

LxL 

multiplicative  tap  matrix 

Lxl 

additive  tap  matrix 

R. 

i 

{2p-t-l)L-t-l  X 
(2p+l)L+l 

auto-correlation  matrix 

r. 

_ 1 _ 

(2p+l)L+l  X  L 

cross-correlation  matrix 

Table  1:  List  of  symbols 


To  compute  the  optimum  filters  in  the  mean-squared  error  sense, 
we  minimize  the  conditional  error  in  each  VQ  region.  The  mini¬ 
mum  mean-squared  error  vector  is  obtained  by  taking  the  gradient 
of  Ei  defined  in  Eq.  (4)  with  respect  to  the  filter  coefficient  matrix 
and  equating  all  the  elements  of  the  gradient  matrix  to  zero.  As  a 
result,  the  optimum  filter  coefficient  matrix  has  the  form, 

W.  =  where 

I  (  < 

N-l-p 

n=p 
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is  a  probabilistic  nonsingular  auto-correlation  matrix,  and 
N-l-p 

r.  =  Y  pig. \z  )  (6) 

I  n  tr^i  n'  ^  ' 

n  =p 

is  a  probabilistic  cross-correlation  matrix. 

The  algorithm  can  be  completely  trained  without  supervision  and 
requires  no  additional  information  other  than  the  simultaneous 
waveforms. 

The  run-time  estimate  of  the  clean  feature  vector  can  be  computed 
by  integrating  the  outputs  of  all  the  filters  as  follows: 


3.  EXPERIMENTS 

A  series  of  experiments  show  how  the  mapping  algorithm  can  be 
used  in  a  continuous  speech  recognizer  across  acoustic  environ¬ 
ments.  In  all  of  the  experiments  the  recognizer  models  are  trained 
with  data  recorded  with  high-quality  microphones  and  digitally 
sampled  at  16,000  Hz.  The  analysis  frame  rate  is  100  Hz. 

The  tables  below  show  three  types  of  performance  indicators: 

•  Relative  distortion  measure.  For  a  given  component  of  a 
feature  vector  we  define  the  relative  distortion  between  the 
clean  and  noisy  data  as  follows: 


•  Word  recognition  error. 

•  Error  ratio.  The  error  ratio  is  given  by  E^/E^  where 
E^  is  the  word  recognition  error  for  the  test-noisy/tr^n- 
clean  condition,  and  E^  is  the  word  recognition  error  of 
the  test-clean/train-clean  condition. 

3.1.  Single  Microphone 

To  test  the  POF  algorithm  on  a  single  target  acoustic  environment 
we  used  the  DARPA  Wall  Street  Journal  database  [15]  on  SRI’s 
DEQPHERTM  phonetically  bed-mixture  speech  recognition  sys¬ 
tem  [2].  The  signal  processing  consisted  of  a  filterbank-based 
front  end  that  generated  six  feature  streams:  cepstrura  (cl-cl2), 
cepstral  energy  (cO),  and  their  first-  and  second-order  derivatives. 
Cepstral-mean  normalization  [16]  was  used  to  equalize  the  chan¬ 
nel.  We  used  simultaneous  recordings  of  high-quality  speech 
(Sennheiser  414  head-mounted  microphone  with  a  noise-cancel¬ 
ing  element)  along  with  speech  recorded  by  a  standard  speaker 
phone  (AT&T  720)  and  transmitted  over  locd  telephone  lines.  We 
will  refer  to  this  stereo  data  as  clean  and  noisy  speech,  respec¬ 
tively.  The  models  of  the  recognizer  were  trained  using  42  male 
WSJO  training  talkers  (3500  sentences)  recorded  with  a  Sen¬ 


nheiser  microphone.  The  models  of  the  mapping  algorithm  were 
trained  using  240  development  training  sentences  recorded  by 
three  speakers.  The  test  set  consisted  of  100  sentences  (not 
included  in  the  training  set)  recorded  by  the  same  three  speakers. 

In  this  experiment  we  mapped  two  of  the  six  features:  the  cep- 
strum  (cl-cl2)  and  the  cepstral  energy  (cO)  separately.  The  deriva¬ 
tives  were  computed  from  the  mapped  vectors  of  the  cepstral 
features.  For  the  conditioning  feature  we  used  a  13 -dimensional 
cepstral  vector  (c0-cl2)  modeled  with  512  Gaussians  with  diago¬ 
nal  covariance  matrices.  The  results  are  shown  in  Table  2. 


Filter  Coefficients 

Average 

Distortion 

Recognition 
Error  (%) 

Error  Ratio 

No  mapping 

0.72 

27.6 

2.46 

Ai,o=I,  bj 

0.62 

18.1 

1.62 

Ai.o .  bi 

0.57 

17.0 

1.52 

A(,-;  ....  A,- ,  bj 

0.51 

17.3 

1.54 

Ai-2  ..  .A, -.2  .bj 

0.50 

16.4 

1.46 

A, ,-5  ..  .A,  .5  ,  bj 

0.49 

15.9 

1.42 

^i,-4  »**•  ^i,-4  > 

0.49 

16.1 

1.44 

Table  2:  Performance  of  the  POF  algorithm  for  different  num¬ 
ber  of  filter  coefficients.  The  number  of  Gaussian  distributions  is 
512  per  feature  and  the  conditioning  feature  is  a  13-dimensional 
c^stral  vector. 

The  baseline  experiment  produced  a  word  error  rate  of  27.6%  on 
the  noisy  test  set,  that  is,  2.46  times  the  error  obtained  when  using 
the  clean  data  channel.  A  34%  improvement  in  recognition  perfor¬ 
mance  was  obtained  when  using  only  the  additive  filter  coefficient 
bj.  (Recognition  error  goes  down  to  18.1%.)  The  best  result 
(15.9%  recognition  error)  was  obtained  for  the  condition  p=3,  in 
which  six  neighboring  noisy  frames  are  being  used  to  estimate  the 
feature  vector  for  the  current  frame.  The  correlation  between  the 
average  relative  distortion  between  the  six  clean  and  noisy  features 
and  the  recognition  error  is  0.9. 

32.  AXIS  Simultaneous  Corpus 

To  test  the  performance  of  the  POF  algorithm  on  multiple  micro¬ 
phones  we  used  SRI’s  stereo-AHS  database.  (See  [1]  for  details.) 
A  corpus  of  both  training  and  testing  speech  was  collected  using 
simultaneous  recordings  made  from  subjects  wearing  a  Sennheiser 
HMD  414  microphone  and  holding  a  telephone  handset.  The 
speech  from  the  telephone  handset  was  transmitted  over  local  tele¬ 
phone  lines  during  data  collection.  Ten  different  telephone  hand¬ 
sets  were  used.  Ten  male  speakers  were  designated  as  training 
speakers,  and  three  male  speakers  were  designated  as  the  test  set. 
The  training  set  consisted  of  3,000  simultaneous  recordings  of 
Sennheiser  microphone  and  telephone  speech.  The  test  set  con¬ 
sisted  of  400  simultaneous  recordings  of  Sennheiser  and  telephone 
speech.  The  results  obtained  with  this  pilot  corpus  are  shown  in 
Table  3. 
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Acoustic  Model  Training 

Test  Set  Word  Error  (%) 

Training 

Data 

Front-End 

Bandwidth 

Sennheiser 

Telephone 

Sennheiser 

Wide 

7.8 

19.4 

Sennheiser 

Telephone 

9.0 

9.7 

Telephone 

Telephone 

10.0 

10.3 

Table  3:  Effect  of  different  training  and  front-end  bandwidth  on 
test  set  performance.  Results  are  word  error  rate  on  the  400  Sen¬ 
tence  simultaneous  test  set. 


We  can  see  from  Table  3  that  there  is  a  15.4%  decrease  in  perfor¬ 
mance  when  using  a  tel^hone  front  end  (7.8%  increases  to  9.0% 
word  error)  and  testing  on  Sennheiser  data.  This  is  due  to  the  loss 
of  information  in  reducing  the  bandwidth  from  100-6400  Hz  to 
300-3300  Hz.  However,  when  we  are  using  a  telqjhone  front  end, 
there  is  only  a  7.8%  increase  in  word  error  when  testing  on  tele¬ 
phone  speech  compared  to  testing  on  Sennheiser  speech  (9.7% 
versus  9.0%).  This  is  a  very  surprising  result,  and  we  had  ejected 
a  much  bigger  performance  difference  when  Sennheiser  models 
are  tested  on  telephone  speech  acoustics. 

3.3.  Multiple  Microphones:  Single  or  Multiple  Mapping 

The  POP  mapping  algorithm  can  be  used  in  a  number  of  ways 
when  the  microphone  is  unknown.  Some  of  these  variations  are 
shown  in  Table  4. 


Experiment 

Word 

Error 

Single  Mapping  Combining  All  10  Telephones 
in  Training  Data 

9.4 

Train  10  Mappings,  One  for  Each  Telephone; 
Run  10  Recognizers  in  Parallel,  each  using  Dif¬ 
ferent  Mapping;  Select  Recognizer  with  Highest 
Probability 

9.2 

Train  10  M^pings,  One  for  Each 
Telq)hone;  Run  10  Mappings  in 
Parallel  and  Average  Features  of 
Best  N  Feature-Streams  that  Have 
Highest  Likelihood 

Topi 

9.3 

Top2 

9.2 

Top3 

8.9 

Top4 

8.7 

Train  15  Mappings  for  WSJ  Cor¬ 
pus;  Run  15  Mappings  in  Parallel 
and  Average  Features  of  Best  N 
Feature-Streams  that  Have  the 
Highest  Likelihood 

Topi 

9.8 

Top2 

9.6 

Top3 

10.3 

Top4 

10.7 

Table  4:  Performance  on  the  multiple-telephone  handset  test  set 
when  mapping  algorithm  is  used  in  different  ways. 


The  differences  between  the  experimental  conditions  are  small, 
but  the  trends  are  different  and  depend  on  the  mapping  and  the 
corpus.  These  differences  depend  on  the  similarities  of  the  differ¬ 
ent  microphones  that  are  used  in  training  conditions,  and  the  rela¬ 
tionship  between  the  training  and  the  testing  conditions. 

When  the  microphones  are  all  similar  (10  telephone  mappings), 
then  averaging  the  features  of  each  mapping  helps  improve  perfor¬ 
mance.  When  the  micrqjhones  are  very  different  (e.g.,  those  in  the 
WSJ  corpus),  averaging  the  features  of  each  mapping  has  a  mini¬ 
mum  when  averaging  two  best  (likelihood)  feature  streams. 

3.4.  Multiple  Microphones:  Conditioning  Feature 

The  next  experiment  varied  the  conditioning  feature.  The  condi¬ 
tioning  feature  is  the  feature  vector  used  to  divide  the  space  into 
different  acoustic  regions.  In  each  region  of  the  acoustic  space  a 
different  linear  transformation  is  train^. 

The  mapping  approach  was  fixed:  we  used  a  single  POP  mapping 
for  multiple  telephone  handsets.  For  this  experiment  we  mapped 
the  cepstrum  vector  (cl-cl2)  and  the  cepstral  energy  (cO).  The 
maximum  delay  of  the  filters  was  kept  fixed  at p=2,  and  the  num¬ 
ber  of  Gaussians  was  512.  The  experimental  variable  was  the  fea¬ 
ture  the  estimates  were  conditioned  on.  We  tried  the  following 
conditioning  features: 

•  Cepstrum.  Same  conditioning  feature  used  in  the  single 
microphone  experiment  (c0-cl2). 

•  Spectral  SNR.  This  is  an  estimate  of  the  instantaneous  sig- 
nal-to-noise  ratio  computed  on  the  log-filterbank  energy 
domain.  The  vector  size  is  25. 

•  Cepstral  SNR.  This  feature  is  generated  by  applying  the 
discrete  cosine  transform  (DCJT)  to  the  spectral  SNR.  The 
transformation  reduces  the  dimensionality  of  the  vector  from 
25  to  12  elements. 

The  results  are  shown  in  Table  5.  The  baseline  result  is  a  19.4% 
word  error  rate.  This  result  is  achieved  when  the  same  wide-band 
front  end  is  used  for  training  the  models  with  clean  data  and  for 
recognition  using  telqjhone  data.  When  a  telephone  front  end  [1] 
is  used  for  training  and  testing,  the  error  decreases  to  9.7%.  The 
disadvantage  of  using  this  approach  is  that  the  acoustic  models  of 
the  recognizer  have  to  be  reestimated.  However,  the  POF-based 
front  end  operates  on  the  clean  models  and  results  in  better  perfor¬ 
mance.  The  cepstral  SNR  produces  the  best  result  (8.7%).  With 
this  conditioning  feature  we  combine  the  effects  of  noise  and  ^ec- 
tral  sh^e  in  a  compact  rqjresentation. 


Experiment 

Word 

Error  (%) 

Error  Ratio 

Wide-band  front-end 

19.4 

2.49 

Telephone-bandwidth  front-end 

9.7 

1.24 

Mapping  with  cepstrum 

9.4 

1.20 

Moping  with  spectral  SNR 

8.9 

1.14 

Mapping  with  cepstral  SNR 

8.7 

1.11 

Table  5:  Performance  for  the  multiple-telephone  handset  test 
set  when  varying  the  conditioning  feature. 
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4.  WSJ  EXPERIMENTAL  RESULTS 

Another  series  of  experiments  was  performed  on  the  WSJ  Speech 
Corpus  [15].  We  evduated  our  system  on  the  5000-word-recogni- 
tion  closed-vocabulary  speaker-independent  speech-recognition 
tasks:  Spoke  S5  Unknown  Microphone,  Spoke  S6:  Known  Micro¬ 
phone,  and  Spoke  S7  Noisy  Environment. 

The  version  of  the  DECIPHER  speaker-independent  continuous 
speech  recognition  system  used  for  these  experiments  is  based  on 
a  prognsssive-search  strategy  [3]  and  continuous-density,  genonic 
HMMs  [2].  Gender-dependent  models  are  used  in  all  passes.  Gen¬ 
der  selection  uses  the  models  with  the  higher  recognition  likeli¬ 
hood. 

The  acoustic  models  used  by  the  HMM  system  were  trained  with 
37,000  sentences  of  Sennheiser  data  from  280  speakers,  a  set  offi¬ 
cially  designated  as  the  WSJO+WSJl  many-speaker  baseline 
training.  A  5,000  closed-vocabulary  back-off  trigram  language 
model  provided  by  M.I.T  Lincoln  Laboratory  for  the  WSJ  task 
was  us^.  Crender-dependent  HMM  acoustic  models  were  used. 

The  front-end  processing  extracts  one  long  spectral  vector  consist¬ 
ing  of  the  following  six  feature  components:  cepstrum,  energy, 
and  their  first  and  second  order  derivatives.  The  dimensionality  of 
this  feature  is  39  (13  *  3)  for  the  wide-bandwidth  spectral  analysis 
and  27  (9  *  3)  for  the  telephone-bandwidth  spectral  analysis.  The 
cepstral  features  are  computed  from  an  FFT  filterbank,  and  subse¬ 
quent  cepstral-mean  normalization  on  a  sentence-by-sentence 
basis  is  performed. 

Before  using  wide-bandwidth  context-dependent  genonic  HMMs, 
a  robust  estimate  of  the  Sennheiser  cepstral  parameters  is  com¬ 
puted  using  POP.  The  robust  front-end  analysis  is  designed  for  an 
unknown  microphone  condition.  The  POP  mapping  algorithm 
estimates  are  conditioned  on  the  noisy  cepstral  observations.  Sep¬ 
arate  mappings  are  trained  for  each  of  the  14  microphones  in  the 
baseline  WSJO+WSJl  si_tr_s  stereo  training,  and  one  mapping  for 
the  overall  case  of  single  nontelephone  mapping.  When  the  default 
no-transformation  zero-mean  cepstra  are  included,  this  makes  a 
total  of  15  estimated  feature  streams.  These  feature  streams  are 
computed  on  each  test  waveform,  and  the  two  feature  streams  with 
the  highest  likelihoods  (using  a  simplified  HMM  for  scoring  the 
features)  are  averaged  together  (Top2).  In  all  cases  the  first  and 
second  delta  parameters  are  computed  on  these  estimated  cepstral 
values. 


Front-End 

Bandwidth 

Signal  Processing 

Test  Set 

Word 
Error  (%) 

Wide 

Standard 

Sennheiser 

5.8 

Telephone 

Standard 

Sennheiser 

9.6 

Telephone 

Standard 

Telephone 

10.9 

Wide 

Robust  POF15 
Cqjstral  Mapping 

Telephone 

11.9 

Table  6:  Performance  on  the  Aug  1993  WSJ  Spoke  S6  develop¬ 
ment  test  set  for  simultaneous  Sennheiser/telephone  recordings 


The  results  in  Table  6  show  that  most  of  the  loss  in  performance 
between  recognizing  on  high-quality  Sennheiser  recordings  and 
on  local  telephone  speech  is  due  to  the  loss  of  information  outside 


the  telephone  bandwidth.  There  is  an  increase  in  the  word-error 
rate  of  66%  when  testing  on  Sennheiser  recordings  with  a  wide- 
bandwidth  analysis  (5.8%)  compared  to  testing  with  a  telqjhone- 
bandwidth  analysis  (9.6%). 

The  loss  in  performance  when  switching  from  Sennheiser  record¬ 
ings  to  telephone  recordings  is  small  in  comparison  to  the  loss  of 
information  due  to  bandwidth  restrictions.  There  is  a  14%  increase 
in  the  word  error  rate  when  testing  on  the  Sennheiser  recordings 
(9.6%)  compared  to  testing  on  the  AT&T  telephone  recordings 
(10.9%). 

4.1.  Official  Spoke  Results:  Unknown  Microphone 

The  results  in  Table  7  show  the  speech  recognition  performance 
when  the  secondary  microphone  condition  is  unknown.  In  these 
experiments,  the  robust  signal  processing  front  end  decreased  the 
word  error  rate  from  17,2  to  13.1%. 


Experiment 

Word  Error 

Sennheiser 

Secondary 

Microphone 

Compensation  Disabled 

6.6 

17.2 

Compensation  Enabled 

6.6 

13.1 

Table  7:  Word  error  rate  with  and  without  compensation  on  both 
Sennheiser  and  secondary  microphone  data 


4J2,  Official  Spoke  Results:  Known  Microphone 

The  results  in  Table  8  show  no  significant  difference  in  speech  rec¬ 
ognition  performance  between  those  obtained  with  an  Audio- 
Technica  microphone  and  those  obtained  with  the  Sennheiser 
microphone.  The  robust  front-end  signal  processing  has  demon¬ 
strated  for  the  first  time  that  one  can  achieve  the  same  performance 
with  a  stand-mounted  microphone  as  with  a  high-quality  close- 
talking  microphone,  all  when  trained  on  a  high-quality  speech  cor¬ 
pus. 


Experiment 

Word  Error 

Sennheiser 

Secondary 

Microphone 

Audio-Technica  Recordings 

5.9 

6.4 

Telephone  Handset  Recordings 

7.2 

19.1 

Table  8:  Word  Error  for  both  Sennheiser  and  Secondary  Micro¬ 
phone  with  Robust  Signal  Processing  Front  End 


43.  Official  Spoke  Results:  Noisy  Environment 

The  results  in  Table  9  show  the  performance  when  the  recordings 
are  made  in  a  noisy  environment.  The  first  noisy  environment  was 
a  computer  room  (average  background  noise  level  of  58  to  59 
dBA),  and  the  second  noisy  environment  was  a  laboratory  with 
mail  sorting  equipment  (average  noise  level  varied  from  62  to  68 
dBA).  The  error  rates  are  significantly  higher  for  the  Audio-Tech- 
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nica  microphone  than  for  the  Sennheiser  microphone  in  the  noisier 
environment.  In  the  computer  room  environment,  the  performance 
with  the  Audio-Technica  microphone  is  almost  indistinguishable 
from  that  of  the  Samheiser  recording. 


Experiment 

Word  Error 

Sennheiser 

Secondary 

Microphone 

Audio-Technica 

Recordings 

Env  1 

6.3 

8.5 

Env  2 

9.1 

17.4 

Telephone  Handset 
Recordings 

Env  1 

8.4 

29.1 

Env  2 

8.3 

28.8 

Table  9:  Word  Error  for  both  Sennheiser  and  Secondary  Micro¬ 
phone  with  Robust  Signal  Processing  Front  End  when  Recorded 
in  Two  Noisy  Environments 


5.  CONCLUSIONS 

We  have  presented  a  feature-mapping  algorithm  capable  of 
exploiting  nonlinear  relations  between  two  acoustic  spaces.  We 
have  shown  how  to  improve  the  performance  of  the  recognizer  in 
the  presence  of  a  noisy  signal  by  using  a  small  database  with 
simultaneous  recordings  in  the  clean  and  noisy  acoustic  environ¬ 
ments. 

We  have  shown  that 

•  There  is  no  significant  difference  in  speech  recognition  per¬ 
formance  between  those  obtained  with  an  Audio-Technica 
microphone  and  those  obtained  with  a  Sennheiser  micro¬ 
phone.  There  is  no  significant  performance  degradation  in  a 
quiet  environment  and  only  a  slight  degradation  in  low- 
noise  environments  (~59  dBA). 

•  Multidimensional  least-squares  filters  can  be  successfully 
used  to  exploit  the  correlation  of  the  features  over  time  and 
among  components  of  the  spectral  features  at  the  same  time. 
These  filters  can  be  conditioned  on  both  local  and  global 
spectral  information  to  improve  robust  recognition  perfor¬ 
mance. 

•  Most  of  the  performance  loss  in  converting  wide-bandwidth 
models  to  tel^hone  speech  models  is  due  to  the  loss  of 
information  associated  with  the  telephone  bandwidth. 

•  It  is  possible  to  construct  acoustic  models  for  telephone 
speech  using  a  high-quality  speech  corpus  with  only  a  minor 
increase  in  recognition  word  error  rate. 

•  A  telephone-bandwidth  system  trained  with  high-quality 
speech  can  oufijerform  a  system  that  is  trained  on  telephone 
speech  even  when  tested  on  telephone  speech. 

•  The  V  ariability  introduced  by  the  telephone  handset  does  not 
degrade  speech  recognition  performance. 


*  Robust  signal  processing  can  be  designed  to  maintain  speech 
recognition  performance  using  wide-bandwidth  HMM  mod¬ 
els  with  a  telephone-bandwidth  test  set. 
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